Nonlinear Policy Gradient Algorithms for Noise-Action MDPs
نویسندگان
چکیده
We develop a general theory of efficient policy gradient algorithms for Noise-Action MDPs (NMDPs), a class of MDPs that generalize Linearly Solvable MDPs (LMDPs). For finite horizon problems, these lead to simple update equations based on multiple rollouts of the system. We show that our policy gradient algorithms are faster than the PI algorithm, a state of the art policy optimization algorithm. We provide an alternate interpretation of the PIalgorithm that further justifies this: The PI algorithm is actually performing gradient descent with respect to a risk seeking objective, rather than the desired expected cost objective. For infinite horizon problems, we develop algorithms that only require estimation of a state value function, rather than a state-action Q function or advantage function. We develop policy gradient algorithms for all MDP formulations (finite horizon, infinite horizon, first exit) for arbitrary policy parameterizations. We demonstrate the effectiveness of the policy gradient algorithms on simple 2-D nonlinear dynamical systems and large linear dynamical systems.
منابع مشابه
Policy gradients in linearly-solvable MDPs
We present policy gradient results within the framework of linearly-solvable MDPs. For the first time, compatible function approximators and natural policy gradients are obtained by estimating the cost-to-go function, rather than the (much larger) state-action advantage function as is necessary in traditional MDPs. We also develop the first compatible function approximators and natural policy g...
متن کاملRisk-Constrained Reinforcement Learning with Percentile Risk Criteria
In many sequential decision-making problems one is interested in minimizing an expected cumulative cost while taking into account risk, i.e., increased awareness of events of small probability and high consequences. Accordingly, the objective of this paper is to present efficient reinforcement learning algorithms for risk-constrained Markov decision processes (MDPs), where risk is represented v...
متن کاملApproximate Newton Methods for Policy Search in Markov Decision Processes
Approximate Newton methods are standard optimization tools which aim to maintain the benefits of Newton’s method, such as a fast rate of convergence, while alleviating its drawbacks, such as computationally expensive calculation or estimation of the inverse Hessian. In this work we investigate approximate Newton methods for policy optimization in Markov decision processes (MDPs). We first analy...
متن کاملUniform Convergence of Value Iteration Policies for Discounted Markov Decision Processes
This paper deals with infinite horizon Markov Decision Processes (MDPs) on Borel spaces. The objective function considered, induced by a nonnegative and (possibly) unbounded cost, is the expected total discounted cost. For each of theMDPs analized, the existence of a unique optimal policy is assumed. Conditions that guarantee both pointwise and uniform convergence on compact sets of the minimiz...
متن کاملA Gauss-Newton Method for Markov Decision Processes
Approximate Newton methods are a standard optimization tool which aim to maintain the benefits of Newton’s method, such as a fast rate of convergence, whilst alleviating its drawbacks, such as computationally expensive calculation or estimation of the inverse Hessian. In this work we investigate approximate Newton methods for policy optimization in Markov decision processes (MDPs). We first ana...
متن کامل